

                          Files and formats


GENERAL FORMAT

All data files are in tab-separated format (extension .tsv). Cells in
a row are only seperated by tabs, and never quoted. Any quotation
marks at the beginning or end of a cell are therefore part of the
data.

The number of columns and their meaning depends on the type of data in
the file. There are three different cases:

  Bible book files (in the Texts directory)
  Virtual Bible book files (in the Texts/*/Masks directories)
  Alignment files (in the Alignments directory)

These different files are described in turn below.


BIBLE TEXTS

The Bible texts are found in the Texts directory, which is organized
after language and translation. The directory names for the
translations follow the format ll_YYYY_Name, where ll is a language
prefix, YYYY the publication year, and Name a short identifier
relating to the title, publisher or translator.

The Bible texts are given in a flat, plain text file format, in one
file for each book. The list of books, in order of appearance, is
given in a file called book_list. In a standard POSIX Unix
environment, the following therefore prints a complete Bible in order:

  $ <book_list xargs -I'{}' cat {}.tsv

In the .tsv files for each book, the first column contains a
book-chapter-verse codes that relate to the original material. The
second column contains the actual texts of the segments.

  Bk.ch.vs<tab>The quick brown fox jumped over the lazy dog.

The book codes (Bk) are taken from the list of abbreviations for the
OSIS XML schema (https://wiki.crosswire.org/OSIS_Book_Abbreviations),
and listed in the file 'OSIS book abbreviations.txt' in this
directory.

The chapter part (ch) of the book-chapter-verse code is a number or,
in some cases, a single letter. Chapter number 0 is reserved for
material that is not inside a chapter – either because it is non-verse
material (title, prologue, etc) that relates to the book rather than a
chapter, or because the book is not subdivided into chapters.

The verse part (vs) is a number. Verse number 0 is reserved for
non-verse segments (titles, prologues, chapter contents, etc). To
illustrate, here is the beginning of Genesis from
nl/nl_1939_Canisius/Gen.tsv:

  Gen.0.0 Genesis
  Gen.1.0 Hoofdstuk 1
  Gen.1.1 In het begin schiep God hemel en aarde.
  Gen.1.2 Maar de aarde was nog ongeordend en leeg, over de ...

An example of a book without chapters is nl/nl_1399_NNNT/3John.tsv:

  3John.0.0       [III JOHANNIS EPISTULA]
  3John.0.1       Ic, Johannes, die een olde bin, ontbiede den ...
  3John.0.2       Alre lieveste, van allen werken doe ic ghebede, ...

The book-chapter-verse codes are not guaranteed to be unique, so they
should not be used as id-s.

We have in general not added any mark up to these texts. The exception
is the (admittedly inconsistent) use of curly braces {...} to indicate
different kinds of highlighting, notably to mark expanded
abbreviations in certain Bibles. This markup is easily stripped, for
instance by something like:

  $ <Gen.tsv sed 's/[{}]//g'

Any other markup-like content comes from the source material, even
though it may have been an editorial addition at that stage. The use
of square brackets [...] in 3John.0.0 above is an example of this.


ALIGNMENT MASKS

The division into books facilitates aligning Bibles, since the text
parallelism at the book level is much greater than at the complete
Bible level (books may be missing, may come in different order,
etc). However, in certain cases, different Bibles put "the same" text
in different places. Choosing the book as the text unit for alignment,
does not address this. Therefore, we use an "alignment mask", or
virtual book: these are manually reorganized parts of a Bible
translation, to match the organization of some other Bible better.

Following EDGeS, the Dutch Nieuwe bijbelvertaling (nl_2004_NBV) is the
reference translation in OpenEDGeS. Note however, that nl_2004_NBV is
not included in OpenEDGeS due to copyright restrictions. It can be
searched in Opus as part of the complete EDGeS corpus, or it can be
consulted at a website like https://debijbel.nl/bijbel/NBV or in
printed form.

The mask files, which make translations more like nl_2004_NBV for the
purpose of aligning, are included in the OpenEDGeS release, since they
nevertheless increase parallelism between the Bibles and may offer a
starting point for further alignment efforts. The general idea is that
if we align two books from two Bibles, and there is a mask for one of
these books, we align the mask, rather than the actual
book. Afterwards, we use the information contained in the mask to
relate the outcome of alignment back to the actual book.

Mask files are found the directories containing the Bible texts, in
subdirectories called 'Masks'. A mask file has 3 columns: 1. unique
references to the locations in the actual book, 2. the
book-chapter-verse codes, 3. the text segments.

  Bk.ln<tab>Bk.ch.vs<tab>The quick brown fox jumped over the lazy dog.

A unique reference consists of a book code (Bk) and the line number
(ln, 0-based) of the segment in the original book file. Note that given
this information, column 2 and 3 are strictly superfuous, but they are
included for convenience and human readibility.

Here is the beginning of the mask file for the Epistel of Jeremiah in
the King James version, en/en_1611_KJV/Masks/EpJer.tsv:

  Bar.154   Bar.6.1 A copy of an Epistle which Ieremie sent ...
  Bar.155   Bar.6.2 Because of {the} sinnes which ye haue ...

The epistel of Jeremiah is given as chapter 6 of Baruch in
en_1611_KJV. The mask for EpJer moves this into its own book file
(since that is how nl_2004_NBV has it). A corresponding mask for the
book of Baruch (Bar) creates a virtual version of Baruch in the KJV
that consists of only the 5 first chapters.

An empty mask file means that a book is excluded from alignment –
typically because its complete contents are reorganized into other
books. Reorganizing Additions to Daniel (AddDan) into separate
books for the prayer of Azariah (PrAzar), Susanna and the elders (Sus)
and Bel and the dragon (Bel) is an example of when this is used.


ALIGNMENTS

Finally, we supply alignment information for pairs of different Bible
translations. The alignment files are in the Alignments directory,
which is divided into subdirectories for the different language pairs,
and one for alignment against the pivot (see below). Alignment files
are named A-B.tsv, where A and B follow the ll_YYYY_Name pattern used
for Bible translations, and A lexicographically comes before B.

As mentioned, we use the nl_2004_NBV translation as a
pivot. Alignments for any pair of translations in OpenEDGeS are
therefore the result of composing alignments against nl_2004_NBV. Even
though the texts for nl_2004_NBV are not part of OpenEDGeS, the
alignments against nl_2004_NBV are included in the distribution. They
can be found in the Alignments/Pivot directory. We refer to the LREC
paper included in this archive, and for the section on alignment masks
above, for details of the alignment procedure.

Alignment files contain two columns: one for the aligned units in the
A translation, and one for the corresponding units in the B translation
Bible. An aligned unit, in turn, consists of one or more unique
references to segments from some book, separated by ',' (comma). The
unique reference is, as above, a combination of a book code and a
0-based line number. A 2-1 alignment would thus look like this:

  Bk.ln,Bk.ln<tab>Bk.ln

The identity of the A and B translations can be read from the file
name.

Here are two lines from the alignment file called
nl_1399_NNNT-nl_1939_Canisius.tsv:

  2Tim.76,2Tim.77   2Tim.75
  2Tim.78   2Tim.76

Which aligns 3 verses from nl_1399_NNNT and 2 from
nl_1939_Canisius in the following way:

  2Tim.4.9  Want Demas hevet my ghelaten, mynnende dese werlt, ...
  2Tim.4.10 Crescens is gheseynt in Galaciam, Tytus in Dalmaciam.
  -
  2Tim.4.10 Want Demas, ...; Crescens naar Galátië, ... Titus ...
  
  2Tim.4.11 Lucas is alleen mit my. Nemet Marcum ende brenghet ...
  -
  2Tim.4.11 Alleen Lukas is bij me gebleven. Haal Markus op, ...

Insertions and deletions are not included in the alignment files:
there are no 1-0 or 0-1 alignments given. These can, however, easily
be recreated by looking for unaligned verse lines in the Bible
books. Only verses are aligned. Nevertheless, the line numbers refer
to the lines in the complete book text files, that is, including
non-verse segments. It is important to not that the use of an
alignment mask is not directly visible in the alignment files, since
all unique references have been resolved: they point into the actual
book text files, and never at the masks.
